Exploratory Data Analysis Project

By Michael Bohn, Nate Wagner, Dominic Ventura

Completeness

Most of our variables of interest were complete.

x
id 1.4061881
adult 1.4061881
belongs_to_collection 88.8165134
budget 1.4061881
genres 0.0000000
homepage 81.6139254
imdb_id 0.0374103
original_language 0.0242067
original_title 0.0000000
overview 2.0773734
popularity 1.4127899
poster_path 0.8208265
production_companies 0.0066018
production_countries 0.0066018
release_date 1.5932397
revenue 1.4127899
runtime 1.9717442
spoken_languages 1.4127899
status 1.5866379
tagline 55.7083755
title 1.4127899
video 1.4127899
vote_average 1.4127899
vote_count 1.4127899
avg_rating 0.0000000

Exploratory Data Analysis

We have data on movies from 1874 to 2020. The number of movies per year clearly increases over time.

Statistic Value
Min. 1874.000
1st Qu. 1979.000
Median 2001.000
Mean 1992.225
3rd Qu. 2010.000
Max. 2020.000
NA’s 724.000

What affects movie ratings?

We have two populations of user’s movie ratings. With the “avg_rating” in movies corresponding to ratings of MovieLens users, and “vote_average” corresponding to TMDb users. There are a couple extreme outliers with average ratings of 0 and vote counts greater than 30. For simplicity and to get a better representation of our scatterplots we are going to exclude these two points.

Vote Average for TMDb Users Value
Min. 0.00000
1st Qu. 5.00000
Median 6.00000
Mean 5.65042
3rd Qu. 6.80000
Max. 10.00000
NA’s 642.00000
Vote Average for MovieLens Users Value
Min. 0.500000
1st Qu. 2.687500
Median 3.161670
Mean 3.060668
3rd Qu. 3.500000
Max. 5.000000
Vote Count Value
Min. 0.0000
1st Qu. 3.0000
Median 10.0000
Mean 111.5553
3rd Qu. 35.0000
Max. 14075.0000
NA’s 642.0000

Correlation Matrix

Looking at the correlation matrix, there doesn’t seem to be any real strong linear relationships between any of these variables and average movie ratings. But perhaps there are some non-linear relationships.

id budget popularity revenue runtime vote_average vote_count avg_rating year
id 1.00 -0.09 -0.06 -0.06 -0.09 0.00 -0.05 -0.05 0.32
budget -0.09 1.00 0.45 0.77 0.13 0.04 0.68 0.03 0.13
popularity -0.06 0.45 1.00 0.51 0.12 0.10 0.56 0.08 0.13
revenue -0.06 0.77 0.51 1.00 0.10 0.08 0.81 0.06 0.09
runtime -0.09 0.13 0.12 0.10 1.00 0.11 0.11 0.12 0.08
vote_average 0.00 0.04 0.10 0.08 0.11 1.00 0.12 0.48 -0.04
vote_count -0.05 0.68 0.56 0.81 0.11 0.12 1.00 0.11 0.11
avg_rating -0.05 0.03 0.08 0.06 0.12 0.48 0.11 1.00 -0.04
year 0.32 0.13 0.13 0.09 0.08 -0.04 0.11 -0.04 1.00

Density plots for popularity, vote count, runtime and average vote.

We are going to assume that to have an unbiased estimate of a movie’s true average rating, there must be at least 30 votes, and we find 12,439 movies that meet this criteria. However, a potential source of bias with this approach is that it could be that movies with very low vote totals, are not very good movies to begin with and aren’t popular. Thus, that could be why they have low vote counts and possible low average ratings.

Are low vote counts associated with lower average ratings?

It could be hard to see the actual trend with the extreme values of vote count.

Even with removing the extreme values of vote count, it’s still hard to see much of a releationship between vote count and vote average.

How has average vote changed over time?

It seems there is a slight negative relationship between average vote and when the movie was released.

How does the runtime of movie affect average votes?

There are 69 movies with runtime equal to zero.

Runtime Value
Min. 0.0000
1st Qu. 91.0000
Median 101.0000
Mean 103.7956
3rd Qu. 114.0000
Max. 877.0000
NA’s 4.0000
## # A tibble: 1 x 1
##       n
##   <int>
## 1    69

Hard to see association with the extreme values of runtime. However, even with the removal of outliers it’s still hard to see much of an association between runtime and average movie rating.

Does average votes increase with popularity?

Not sure how popularity is measured.

Popularity Value
Min. 0.002538
1st Qu. 4.002335
Median 6.505146
Mean 7.779560
3rd Qu. 9.698631
Max. 547.488298

Even with the removal of extreme outliers, there seems to be no relationship between movie popularity and average movie rating.

Word association in movies:

The top three most common words are woman, relationship and independent.

Word Frequency
woman woman 3621
relationship relationship 2094
independ independ 1952
base base 1872
murder murder 1868
love love 1543
music music 1473
war war 1451
nuditi nuditi 1428
sex sex 1074

Breakdown of movies by cast gender:

How has revenue changed over time?

Here we have two different views of the association.

## # A tibble: 93 x 2
##     year      Mean
##    <int>     <dbl>
##  1  1915 11000000 
##  2  1921  2500000 
##  3  1923      623 
##  4  1924  1213880 
##  5  1925  1272550 
##  6  1927   325272.
##  7  1930     7940 
##  8  1931  4343790 
##  9  1932  1597000 
## 10  1933  6140500 
## # … with 83 more rows

Revenue vs Popularity

Average budget, revenue, and profit:

x
Budget 34.14421
Revenue 99.88446
Profit 65.74025

Correlations

popularity budgetM revenueM profitM runtime
popularity 1.0000000 0.2903461 0.4301078 0.4274831 0.0749828
budgetM 0.2903461 1.0000000 0.7218920 0.5724324 0.1661427
revenueM 0.4301078 0.7218920 1.0000000 0.9806458 0.1732649
profitM 0.4274831 0.5724324 0.9806458 1.0000000 0.1582931
runtime 0.0749828 0.1661427 0.1732649 0.1582931 1.0000000

Revenue vs Budget

Two main outliers:

title budgetM revenueM
23 Pirates of the Caribbean: On Stranger Tides 380 1046
title budgetM revenueM
Avatar 237 2788

Linear model for revenue vs. budget:

Revenue = -3.34 + 3.02*(Budget) Need to spend money to make money.

## 
## Call:
## lm(formula = revenueM ~ budgetM, data = metascrubdf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -678.58  -43.24   -6.87   18.09 2074.84 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.34092    2.22597  -1.501    0.133    
## budgetM      3.02322    0.04164  72.612   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 119.3 on 4845 degrees of freedom
## Multiple R-squared:  0.5211, Adjusted R-squared:  0.521 
## F-statistic:  5273 on 1 and 4845 DF,  p-value: < 2.2e-16

Profit vs Popularity

Correlation is .43 Popularity is not highly correlated with profit

## [1] 0.4274831

Outliers: popularity: Minions profit: Avatar

Profitability of genres

Action, Adventure, Comedy, and Drama top the list.

Genre profitability list:

Main Genre Profit
Action 74585
Adventure 56355
Comedy 45314
Drama 41911
Animation 25554
Science Fiction 12176
Fantasy 12056
Horror 11471
Family 9973
Thriller 9206
Crime 7377
Romance 5980
Mystery 2317
History 1292
Music 882
War 756
Western 708
Documentary 528
TV Movie 37
Foreign 17

Profit vs Budget

Profit and profit variability increase with budget. The top 10% budgeted movies have the most profit but also the most varibility.

Runtime Information

Mean of 111 minutes with many outliers above the upper quartile

## [1] 111.0433

Genre and Popularity

Popularity corresponds with profit above

Here are the top ten most profitable collections:

Collection profit
Star Wars Collection 6579
Harry Potter Collection 6427
James Bond Collection 5566
The Fast and the Furious Collection 4115
Transformers Collection 3401
Despicable Me Collection 3393
Pirates of the Caribbean Collection 3272
The Twilight Collection 2957
Ice Age Collection 2788
Jurassic Park Collection 2653